*ICMR - Isolated Congenital Mitral Regurgitation
Theories:
Aim:
Methods used :
Descriptive Statistics
Flow chart of the study
About the data
The data set contains variables on 119 patients. The outcome variable is Cause_of_redo (qualitative information).
The data set has a total of 45 variables (after processing). Some of the predictor variables which are going to be used in the multinomial regression model are:
Data Visualization
nnet
\(\underline{Step \hspace{.05in} 1}\): Relevel the baseline groups
data$Cause_of_redo <- factor(data$Cause_of_redo, levels = c("NONE", "MR", "MS"))
data$Age_group <- factor(data$Age_group, levels = c('0-4 yrs', '5-9 yrs', '10-14 yrs', '15-18 yrs'))
data$GROUP <- factor(data$GROUP, levels = c("NONE", "BAND", "RING"))
data$Ring_Group <- factor(data$Ring_Group, levels = c("NONE", "<26", ">=26"))
\(\underline{Step \hspace{.05in} 2}\): Split the data into random training and testing sets.
The model will attempt to learn the relationship on the training data and be evaluated on the test data. In this case, 80% of the data is used for training and 20% for testing.
set.seed(12)
train <- sample_frac(data, 0.8)
sample_id <- as.numeric(rownames(train))
test <- data[-sample_id,]
\(\underline{Step \hspace{.05in} 3}\): Fit the model and obtain the results
model <- multinom(Cause_of_redo ~ Age_group + Sex + GROUP + Ring_Group, data = train)
summary(model)$coefficients
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs SexM
## MR -28.79478 1.386831 -25.70965 -18.36790 0.1247472
## MS -33.80229 2.572523 -32.35526 -13.59761 1.6581460
## GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 26.53103 17.15680 8.551873 8.604928
## MS -11.91916 20.27758 11.654593 8.622985
\(\underline{Step \hspace{.05in} 4}\): Find the p-value of each coefficient
z <- summary(model)$coefficients/summary(model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs
## MR 0 0.17945973 0 0
## MS 0 0.01161087 0 0
## SexM GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 0.90353658 0 0 0 0
## MS 0.09261432 0 0 0 0
Comment: Overall all coefficients (except for SexM variable and Age_group5-9) are statistically significant based on their p-values being smaller than 0.05. The model equation for the first row is then:
\[ \scriptsize \begin{eqnarray} \text{ln}\bigg(\frac{Pr(\text{type} = \text{MR})}{Pr(\text{type = NONE})}\bigg) &= \beta_{10} + \beta_{11}(\text{age = 5-9 yrs}) + \beta_{12}(\text{age = 10-14 yrs}) + \beta_{13}(\text{age = 15-18 yrs}) + \beta_{14}(\text{Sex = M}) \\ & + \beta_{15}(\text{grp = Band}) + \beta_{16}(\text{grp = Ring}) + \beta_{17}(\text{Ring_grp = "<26"}) + \beta_{18}(\text{Ring_grp = ">=26"}) \end{eqnarray} \]
For MR patients:
The model equation for the second row is: \[
\scriptsize
\begin{eqnarray}
\text{ln}\bigg(\frac{Pr(\text{type} = \text{MS})}{Pr(\text{type = NONE})}\bigg) &= \beta_{20} + \beta_{21}(\text{age = 5-9 yrs}) + \beta_{22}(\text{age = 10-14 yrs})
+ \beta_{23}(\text{age = 15-18 yrs}) + \beta_{24}(\text{Sex = M})\\
& + \beta_{25}(\text{grp = Band}) + \beta_{26}(\text{grp = Ring})
+ \beta_{27}(\text{Ring_grp = "<26"}) + \beta_{28}(\text{Ring_grp = ">=26"})
\end{eqnarray}
\]
For MS patients:
For both MR and MS,
Comments: Based on the above statistical findings, it is safe to assume that:
For patients with < 26mm (small size) CE ring, there is a higher chance of relapse caused by mitral stenosis (MS).
For patients with band, there is a higher chance of relapse caused by mitral regurgiation (MR)
Relative Risk
The ratio of the probability of choosing one outcome category over the probability of choosing the baseline category is often referred as relative risk. The output coefficients are represented in the log of odds, hence relative risk can be computed by taking the exponential of the intercepts from the linear equation.
exp(coef(model))
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs SexM
## MR 3.123102e-13 4.002148 6.830311e-12 1.054195e-08 1.132862
## MS 2.088588e-15 13.098830 8.877505e-15 1.243459e-06 5.249569
## GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 3.328751e+11 28255587 5176.443 5458.495
## MS 6.661528e-06 640383534 115219.358 5557.951
A few comments:
\(\underline{Step \hspace{.05in} 5}\): Look at the predicted probabilities
head(pp <- fitted(model))
## NONE MR MS
## 1 1.0000000 3.727184e-13 3.464353e-16
## 2 0.9058297 9.417031e-02 1.260298e-20
## 3 1.0000000 3.727184e-13 3.464353e-16
## 4 0.7061825 2.938175e-01 1.286993e-19
## 5 1.0000000 3.290060e-13 6.599310e-17
## 6 1.0000000 3.727184e-13 3.464353e-16
The probability of \(n^{th}\) obs being “NONE”, “MR” or “MS” shown in the above table can be interpreted in percentage. For example, the probability of the first observation being “NONE” is 100%, it being “MR” is 0.0% and and it being “MS” is 0.0%. Thus we can conclude that the patient from this observation did not have to have a re-operation.
Training vs Test sets
train$predicted <- predict(model, newdata = train, "class")
# Building classification table
ctable <- table(train$Cause_of_redo, train$predicted)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100, 2)
## [1] 86.32
test$predicted <- predict(model, newdata = test, "class")
# Building classification table
ctable2 <- table(test$Cause_of_redo, test$predicted)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable2))/sum(ctable2))*100, 2)
## [1] 87.5
Comment: Accuracy in training dataset is 86.32% and the accuracy of the test set is 87.5% , which is slightly higher than that of the training data. This is not ideal since test accuracy should not be higher than that of training as the model is optimized for the latter. The model performs well overall because it performs well not only on the training data but also on the test (unseen) data.
Summary
Age group effect plot
Comment: The plot shows the difference in the average age trajectory between the “NONE”, “MR” and “MS” groups, with the fitted response line for the “NONE” group being significantly above the latter. The fitted probability lines for for “MR” and “MS” are identical in 10-14 yrs and 15-18 yrs age groups, and different in 0-4 yrs and 5-9 where the MS line is higher.
survivalsurvminerMethod: Kaplan-Meier non-parametric survival estimate. Kaplan-Meier curves are especially useful when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females).
Measures of interest: time to relapse
Results
Redo variable is a two level (Yes and No) categorical variable - whether there is a re-operation done for each individual patient. It is used as an indicator to determine if an individual is censored:
\[\text{Redo} \begin{cases} * \textbf{Yes} \hspace{.05in} \text{to re-operation} : \textbf{not censored}\\ * \textbf{No} \hspace{.05in} \text{to re-operation} : \textbf{censored}\\ \end{cases}\]
Data with a plus sign are censored data and otherwise.
## [1] 26+ 31+ 242 240+ 228+ 96+ 27+ 37+ 111+ 154 141+ 179+ 170+ 271+ 237+
## [16] 247+ 226+ 242+ 261+ 270+
The following visualizations depict the survival rate (or disease-free rate in this case) without grouping and with grouping.
summary(fit)$table
## records n.max n.start events *rmean *se(rmean) median
## 119.00000 119.00000 119.00000 24.00000 256.57462 11.71914 308.00000
## 0.95LCL 0.95UCL
## 300.00000 NA
Comment: The horizontal axis (x-axis) represents time in months, and the vertical axis (y-axis) shows the probability of having no disease or the proportion of people having no disease. The line represent survival curves of the population. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.
Table of survival analysis showing the first 10 observations
| time | n.risk | n.event | n.censor | surv | upper | lower |
|---|---|---|---|---|---|---|
| 9 | 119 | 0 | 2 | 1 | 1 | 1 |
| 11 | 117 | 0 | 1 | 1 | 1 | 1 |
| 14 | 116 | 0 | 1 | 1 | 1 | 1 |
| 19 | 115 | 0 | 1 | 1 | 1 | 1 |
| 21 | 114 | 0 | 1 | 1 | 1 | 1 |
| 24 | 113 | 0 | 1 | 1 | 1 | 1 |
| 26 | 112 | 1 | 2 | 0.9911 | 1 | 0.9738 |
| 27 | 109 | 1 | 4 | 0.982 | 1 | 0.9575 |
| 31 | 104 | 0 | 2 | 0.982 | 1 | 0.9575 |
| 33 | 102 | 0 | 1 | 0.982 | 1 | 0.9575 |
Comment: The log-rank p-value for Sex variable is lower than significance level \(\alpha = 0.05\) as shown in the plot above. This is also supported by the sex variable’s p-value in multinomial logistic regression model. Therefore Sex is not a significant predictor hence has no effect on the survival/disease-free curve.
## records n.max n.start events *rmean *se(rmean) median
## Age_group=0-4 yrs 45 45 45 8 222.4189 18.22248 204
## Age_group=5-9 yrs 36 36 36 16 204.0747 17.20604 242
## Age_group=10-14 yrs 31 31 31 0 290.0000 0.00000 NA
## Age_group=15-18 yrs 7 7 7 0 290.0000 0.00000 NA
## 0.95LCL 0.95UCL
## Age_group=0-4 yrs 198 NA
## Age_group=5-9 yrs 178 NA
## Age_group=10-14 yrs NA NA
## Age_group=15-18 yrs NA NA
Comment: The log-rank p-value for Age variable is statistically significant. The survival curves show the 0-4 yrs age group to have less survival/disease-free advantage in comparison to the other groups. The median survival time for age_group = 0-4 yrs is 204 months, as opposed to 242 months for age_group = 5-9 yrs.
Further analysis to evaluate whether the differences between the age groups are statistically different can be done using a Log-Rank test.
surv_diff <- survdiff(Surv(time, status) ~ Age_group, data = data)
surv_diff
## Call:
## survdiff(formula = Surv(time, status) ~ Age_group, data = data)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## Age_group=0-4 yrs 45 8 6.367 0.419 0.608
## Age_group=5-9 yrs 36 16 8.745 6.020 9.730
## Age_group=10-14 yrs 31 0 7.987 7.987 12.143
## Age_group=15-18 yrs 7 0 0.902 0.902 0.950
##
## Chisq= 15.4 on 3 degrees of freedom, p= 0.001
The log rank test for difference in survival gives a p-value of p = 0.001, indicating that the age groups differ significantly in survival.
Sharma, Mohit. 2018. MULTINOMIAL LOGISTIC REGRESSION USING R. https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/.
2004. Introduction to Survival Analysis. http://www.stat.columbia.edu/~madigan/W2025/notes/survival.pdf.
n.d. Survival Analysis Basics. STHDA. http://www.sthda.com/english/wiki/survival-analysis-basics.
n.d. MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES. UCLA:Statistical Consulting Group. https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/.